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ABSTRACT 

This study analyzed the relationship between 
selecting critical errors (choices that would be dangerous to 
patients) and conventional test scores on a medical school certifying 
examination that included three item formats: regular and weighted 
multiple-choice questions and patient management problems. Data from 
a Clin.'cal Certifying Examination administered to 279 seniors as the 
University of Illinois College of Medicine were analyzed. It was 
round that while there were significant negative correlations between 
test scores and number of critical errors made across the three 
different item formats, there were nonetheless students who passed 
the examination although they made a relatively large number of 
critical errors. Implications for teaching and testing are discussed. 
Four tables present data on critical errors. (SLD) 
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ABSTRACT 

^ This study analyzed the relal>4^onship between selecting critical errors 

(choices that would be dangerous to patients) and conventional test scores on 
a medical school certifying examination that included three item formats: 
regular and weighted multiple choice questions and patient management 
problems. It was found that while there were significant negative 
correlations between test scores and number of critical errors made across the 
three different i^em formats, there were nonetheless students viho passed the 
examination alt-hoibgh they made a relatively large number of critical errors. 
Implications for teaching and testing are discussed. 
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THE SELECTION OF CRITICAL ERRORS ON A MEDICAL SCHOOL CERTIFYING EXAMINATION 
Ob.1ect1ves of the Study 

There is concern in the academic medical community that some students may 
meet requirements for medical school graduation and yet may be prone to making 
dangerous mistakes which could jeopardize their patients' lives or retard 
recovery. The faculty committee charged with the responsibility for 
developing a certifying examination for the University of Illinois College of 
Medicine (UICOH) has, over the years » viewed the graduation of students who 
make dangerous mistakes on this examination with alarm. 

This study was therefore undertaken to answer the following questions: 
1) Is there a correlation between overall test performance and the number of 
critical errors made? 2) Is there consistency in the selection of critical 
errors across different item formats? 3) Are therp examinees who pass the 
test and yet make a significant number of critical errors? and 4) Are there 
examinees who fail the examination but who make only a modest number of such 
erro»"s? 

Review of the Literature 

Three studies have recently appeared that address this issue. Grosse 
(1986) reports on a study of the selection of dangerous responses to multiple 
choice items on the American Board of Orthopaedic Surgery's 1983 and 1984 
certifying examinations. The mean nurr.ber of dangerous options selected was 
low: 1.4 (S.O. = 1.3) with 31 possible 1n 1^83 (n = 548) and 2-8 (S-D, = 1,9) 
with 66 possible in 1984 (n = 746). The Pearson correlations between total 
test scores and number of dangerous options selected exceeded -1.00 for both 
years after correction for attenuation due to unreliability. Comparisons were 
made between different examinee subgroups. U.S. and Canadian medical school 
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graduates selected significantly fewer dangerous responses than did foreign 
medical school graduates, and examinees taking the test for the first time 
selected significantly fewer dangerous responses than did repeaters. These 
results held across both the 1983 and 1984 examinations. 

Because of the high negative correlations between total test scores and 
number of dangerous options selected, Grosse (1986) concludes that, "It seems 
unlikely that dangerous option scores could contribute any information about 
candidate ability not already contained in the total test scores." (p. 465) 
He adds that the study results do provide further evidence of the tests = 
construct validity. Candidates generally select few dangerous options 
indicating there is a good match between their preparation and the test 
content, and high-scoring examinees make fewer such choices than low-scoring 
examinees . 

Hankin, Lloyd, and Rovinelli (1987) also studied the selection of 
dangerous answers on four multiple choice specialty board examinations 
^administered to 2,713 examinees between 1981 and 1983. A panel of experts 
identified dangerous answers on 491 out of a total of 903 multiple choice 
questions (46?4). Scores were combined across the four examinations by 
expressing the conventional percent correct scores in standard deviation units 
from the mean and the percent dangerous answer scores on a seven-point scale 
based on the number of dangerous answers selected compared to the total number 
of items on the test. Arbitrary pass/fail scores were set at 1.5 standard 
deviations below tlie mean for the conventional score and at three for the 
dangerous answer score. 

Overall, the examinees chose dangerous responses for 8% of the items. 
Those who passed on the basis of their percent correct scores chose dangerous 
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responses for 6X of the Items, and those who failed selected 11%. When the 
pass/fail rate was compared for the two scoring methods, It was found that the 
percent with a falling conventional score and a passing dangerous answer score 
was 3.2, and 10. 4X had a passing conventional score and a falling dangerous 
answer score. The authors conclude that scoring dangerous answers may provide 
useful Information about examinees, particularly those close to the pass/fall 
score. 

Slogoff and Hughes (1987) reach a different conclusion based on a study of 
2,449 candidates who took the American Board of Anesthesiology's 1983 written 
examination. A panel of experts Identified 29 Items with 43 dangerous 
responses out of 175 multiple choice questions. The 1,036 candidates who 
passed the test selected a mean of 1.6 (S.O. = 0.3; range = 0-7) dangerous 
answers, and the 1,413 who failed selected a mean of 3.4 (S.D. = 0.4; range = 
0-10), a significant difference (p < .001). 

Selection of four or more dangerous responses was used to define a 
potentially dangerous group for further evaluation. There were 92 passing 
candidates and 631 falling candidates In this group. For the passing group, 
their "potentially dangerous" status was not confirmed by ratlnvjs of residency 
performance or by performance on the oral examination given subsequent to the 
written examination (86 of the 92 took this examination). The authors 
therefore conclude that, . .Implementation of alternate scoring by the 
dangerous answer format would be unnecessarily punitive and unwarranted." (p. 
630) 

In summary, three studies have looked at the selection of dangerous 
responses on specialty board examinations. Two of these conclude that scoring 
In this manner Is not helpful, one because of high negative correlations with 
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regular percent correct scores and the other because of lack of supporting 
evidence from other Information sources. Vhe authors of the third study feel 
that this approach to scoring may be useful, particularly for students close 
to the pass/fall score* 
Instrument and Methods 

The data for this study were d3r1ved from a recent Clinical Certifying 
Examination given to 279 seniors at the UICOH. All students are required to 
pass this two-day examination prior to graduation. The examination consisted 
of regular multiple choice questions (RHC), weighted multiple choice questions 
(WMC), patient management problems (PHP), and a data gathering problem (OGP). 

The regular multiple choice questions focus on data Interpretation and 
patient management rather than recall of factual Information. The weighted 
multiple choice questions also focus on patient management, but the options 
are weighted from +8 to -8 depending on their appropriateness or 
inapproprlateness for the care of the patient at that point In time. The PHPs 
utilize latent Image printing technology, and each problem contains several 
sections corresponding to various stages In the work-up and management of a 
patient. Like the weighted multiple choice questions, the options are 
weighted from +8 to -8 depending on their approprlateness/lnapproprlateness. 
Data gathering problems are short answer Items usually focussing on some 
aspect of u patient work-up such as ordering diagnostic studies. There was 
only one short OGP on this examination which did not lend Itself to the 
Identification of critical errors, and It was therefore not Included In the 
study. 

Critical errors (CEs) on this examination were defined by the faculty 
committee as options which. If chosen, would be likely to place the patient In 
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significant Jeopard,. Examples of critical errors are performing a lumbar 
puncture 1„ the presence of papllle<len«: performing a gastroscopy ,„ the 
presence of probable unstable angina: and giving quinldlne to apatlent In 
atrial fibrillation with a rapid ventricular response prior to controlling the 
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critical errors were Initially Identified by the full committee during Its 
review of materials selected for the examination. If agreement by consensus 
was not evident, options were not Included as constituting critical errors. 
The questions were given a final review by one of the authors who Is a member 
of the committee for conformity to the definition. 

There were 295 RHCs (usually with 5 options). 48 WMCs (usually with 7 
options), and 9 PMPs with a total of 580 options. As Table 1 Indicates, there 
was a total of 108 CEs across the three test formats: 55 RMC. 18 WMC. and 35 
PHP. The possible number of CEs per examinee was 63 as some Items contained 
more than one CE. There were 21 RHC. 14 WMC. and 28 PMP. 

Student scores Included a regular multiple choice score, a weighted 
multiple choice score, a patient management problem score, and a total score. 
The total score was a weighted combination of the part scores based on amount 
of examination time. There were three critical error scores for each of the 
three Item formats plus a total number of critical errors across Item formats. 

The reliabilities for the different parts of the examination were .86 for 
the RHC (Kuder-Rlchardson Formula 20). .49 for the weighted multiple choice 
(Angoff Formula 12). and .70 for the PHPs (Angcff Formula 12). 

To further analyze the relationship between total test performance and 
number of CEs made, the examinees were divide Into four subgroups. These 
were: l) those scoring below the minimum pass level of 62X (n = 13) ; 2) 
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those scoring at or slightly above the MPL (62-55%; n = 23)); 3) those scoring 
between the mean and one standard deviation below the mean (66-72X; n = 108); 
and 4) those scoring above the mean (73X and above; n = 135), 
Results 

The mean and standard deviation of the number of CEs selected by item type 
and by subgroup appear in Table 2. For the total group the mean number of CEs 
on the RHCs was 3.04 (S,D. = 1.69); on the WMCs it was 0.77 (S.D. = 0.79); and 
it was 1.96 (S.D. = 1.48) for the PHPs. The mean number of total CEs across 
item types was 5.77 (S.D. = 2.66). 

The range of CEs was 1 to 18. For the 13 failing students it was 6-18, 
and for the barely passing group (n = 23) it was 4-12. For the students with 
the highest scores in the class (n = 12), the number of CEs ranged from 1-5. 

There were 14 students with 11 or more CEs (2 standard deviations above 
the mean). Of these, four failed the total test, and ten passed. None of 
these ten scored above the mean, however, and four of them were in the barely 
passifi^j group. 

Correlations between the test scores and number of CEs appear in Table 3. 
There are significant negative correlations between all of the variables. The 
largest correlations are between RHC score and RHC CEs (-.50); Total score and 
RHC CEs (-.46); PHP score and PHP CEs (-.59); RHC score and Total CEs (--53); 
PHP score and Total CEs (-.53); and total score and Total CEs (-.59). 

Table 3 also contains correlations between critical errors on different 
item formats and correlations between test scores on different item formats. 
The correlations between critical errors on different item formats were low 
though significant (p < .03): .12 between RHC and WHC, .16 between RHC and 
PHP, and .12 between WHC and PHP. The correlations between test scores on 
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different Item formats were moderate: .55 between RHC and WHC, .52 between 
RHC and PHP, and .35 between WMC and PMP (p < .000). 

In order to further explore the relationship between total test 
performance, ANOVA was used to compare the four subgroups on the number of CEs 
selected, and the results appear In Table 4. There were significant 
differences among the four groups on the number of CEs by Item type as well as 
total number of CEs. 

A posteriori comparisons were made using the Scheffe procedure v/lth the 
significance level set at .05. For RHC, Group 1 made significantly more CEs 
than Groups 2, 3, and 4, and Groups 2 and 3 made significantly more CEs than 
Group 4. For WHC, the only significant palrwise comparison was between Groups 
3 and 4. Group 3 made significantly more CEs than Group 4. For PHP, Groups 
1, 2, and 3 made significantly more CEs than Group 4. For Total CEs, Groups 
1, 2, and 3 made significantly more CEs than Group 4, and Group 1 made 
significantly more CEs than Group 3. 
E ducational Significance 

Overall, the students made approximately six CEs on this examination: 
three on RHC, one on WHC, and two on PHP. It Is Interesting to note that they 
tended to make fewer CEs on the formats with weighted options, relative to the 
total number possible (RHC = 21; WHC + PHP = 42). The students were aware 
that pene'.ltles were attached to Incorrect choices, and perhaps this 
Information made them more cautious. 

There were significant negative correlations between test scores and 
number of critical errors selected across the three different Item formats 
(regular and weighted multiple choice and patient management problems). The 
correlations between the number of critical errors made on different Item 
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formats were low although slgnlf Icantly different from zero. The number of 
CEs made by the falling group was at or exceeded the total group mean for all 
13 students. Of the 14 examinees who chose a large number of critical errors, 
only four examinees failed the test. 

When the faculty committee reviewed these results, concern was expressed 
that students are Indeed passing the examination who make significant numbers 
of xsis. Indicating that they lack Information and/or decision-making skills 
which might put the^r patients at serious risk. In addition to extending the 
study to a second group of examinees and reviewing the clerkship performance 
of students who make a large number of CEs, other options being considered 
Include weighting CEs more heavily so that selecting a CE has a greater 
negative effect on the student's overall score than Is currently the case and 
developing a CE subsection wUh a required minimum pass level for that 
section. The examination committee has also prepared feedback for the 
curriculum committees and department chairs to alert them to apparent 
weaknesses In their Instructional programs. 
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TABLE 1 

NUMBER OF CRITICAL ERRORS ACROSS ITEM TYPE 



Total Number of 
Critical Errors 



Regular 
Multiple 
Choice 
55 



Weighted 
Multiple 
Choice 
18 



Patient 
Management 
Problems 
35 



Total 
108 



Possible Number of 
Critical Errors 
Per Examinee 



21 



14 



28 



63 



TABLE 2 

MEAN AND STANDARD DEVIATION OF CRITICAL ERRORS 





Total 












Group 


Group 1 


Group 2 


Group 3 


Group 4 




(n = 279) 


(n = 13) 


(n = 23J. 


(n = 108) 


(n = 135) 


Regular Multiple 


3.04 


5.46 


4.CiO 


3.45 


2.32 


Choice Critical 


(1.59) 


(1.66) 


(1.71) 


(1 .60) 


(1.33) 


Errors 












Weighted Multiple 


0.77 


1 .15 


1 .00 


0.90 


0.59 


Choice Critical 


(0.79) 


(vO.99) 


(0.74) 


(0.85) 


(0.68) 


Errors 












Patient Management 


1 .96 


3.23 


2.83 


2.21 


1 .49 


Problem Critical 


(1.48) 


(2.49) 


(1 .50) 


(1.32) 


(1.30) 


Errors 












Total Critical 


5.77 


9.84 


7.83 


6.56 


4.39 


Errors 


(2.66) 


(3.34) 


(2.42) 


(2.30) 


(1.93) 



Group 1 = Total test score less than 52?t 

Group 2 = Total test score 52-55X 

Group 3 = Total test score 55-72?4 

Group 4 = Total test score 73% and above 
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TABLE 3 
CORRELATIONS 

Corrolations between Critical Errors and Test Scores 





RHC 


WHC 


PHP 


Total 




CEs 


CEs 


CEs 


CEs 


RHC Score 


-.50 


-.17 


-.30 


-.53 




p=.000 


p=.003 


p=.000 


p=.000 


UHC Score 


-.29 


-.27 


-.13 


-.34 




p=.000 


pr 0.10 


p=.014 


p=.000 


PHP Score 


-.24 


-.18 


-.59 


-.53 




p=.000 


p-'.OOl 


p=.000 


p=.000 


Total Score 


-.46 


-.22 


-.41 


-.59 




p=.000 


p=.000 


p=.000 


p=.000 



Correlations bt-^ween Number of Crl tical Errors on Different Item Formats 







WHC 


PHP 


Total 






CEs 


CEs 


CEs 


RHC 


CEs 


.12 


.16 


.76 






p=.024 


p=.003 


p=.000 


WHC 


CEs 




.12 


.44 








p=.022 


p=.000 


PHP 


CEs 






.70 










p=.000 



Correlations between Test Scores on Different Item Formats 







WHC 


PHP 


Total 






Score 


Score 


Score 


RHC 


Score 


.55 


.52 


.93 






p=.000 


p=.000 


p=.000 


WHC 


Score 




. 55 
p=.000 


.72 
p=.000 


PHP 


Score 






.72 
p=.000 
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TABLE A 
ANOVA RESULTS 



Critical Errors on Regula*" Multiple Choice Questions 

Source D,F. Sum of Squares Mean Squares F Ratio F Prob. 

Between Rroups 3 185.1804 62.0601 28.195 0.000 

Within Groups 275 605.2997 2.2011 

Total 278 791.4800 



Critical Errors on Weighted Multiple Choice Questions 

Source D.F. Sum of Squares Mean Squares F Ratio F Prob. 

Between Groups 3 9.5143 3.1714 5.307 0.0014 

Within Groups 275 164.3408 0.5976 

Total 278 173.8551 



Critical Errors on Patient Management Problems 

Source D.F. Sum of Squares Mean Squares F Ratio F Prob. 

Between Groups 3 75.1189 25.0396 12.860 0.0000 

Within Groups 275 535.4454 1.9471 

Total 278 610.5642 



Total Number of Critical Errors 

Source D.F. Sum of Squares Hean Squares F Ratio F Prob. 

Between Groups 3 637.5836 212.5279 43.821 0.0000 

Within Groups 275 1333.7198 4.8499 

Total 278 1971.3025 
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